{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# \ud83d\udcdd Exercise M7.01\n",
    "\n",
    "In this exercise we will define dummy classification baselines and use them as\n",
    "reference to assess the relative predictive performance of a given model of\n",
    "interest.\n",
    "\n",
    "We illustrate those baselines with the help of the Adult Census dataset, using\n",
    "only the numerical features for the sake of simplicity."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
    "adult_census = pd.read_csv(\"../datasets/adult-census-numeric-all.csv\")\n",
    "data, target = adult_census.drop(columns=\"class\"), adult_census[\"class\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First, define a `ShuffleSplit` cross-validation strategy taking half of the\n",
    "samples as a testing at each round. Let us use 10 cross-validation rounds."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Write your code here."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, create a machine learning pipeline composed of a transformer to\n",
    "standardize the data followed by a logistic regression classifier."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Write your code here."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Compute the cross-validation (test) scores for the classifier on this dataset.\n",
    "Store the results pandas Series as we did in the previous notebook."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Write your code here."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, compute the cross-validation scores of a dummy classifier that constantly\n",
    "predicts the most frequent class observed the training set. Please refer to\n",
    "the online documentation for the [sklearn.dummy.DummyClassifier\n",
    "](https://scikit-learn.org/stable/modules/generated/sklearn.dummy.DummyClassifier.html)\n",
    "class.\n",
    "\n",
    "Store the results in a second pandas Series."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Write your code here."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now that we collected the results from the baseline and the model, concatenate\n",
    "the test scores as columns a single pandas dataframe."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Write your code here."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "Next, plot the histogram of the cross-validation test scores for both models\n",
    "with the help of [pandas built-in plotting\n",
    "function](https://pandas.pydata.org/pandas-docs/stable/user_guide/visualization.html#histograms).\n",
    "\n",
    "What conclusions do you draw from the results?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Write your code here."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Change the `strategy` of the dummy classifier to `\"stratified\"`, compute the\n",
    "results. Similarly compute scores for `strategy=\"uniform\"` and then the  plot\n",
    "the distribution together with the other results.\n",
    "\n",
    "Are those new baselines better than the previous one? Why is this the case?\n",
    "\n",
    "Please refer to the scikit-learn documentation on\n",
    "[sklearn.dummy.DummyClassifier](\n",
    "https://scikit-learn.org/stable/modules/generated/sklearn.dummy.DummyClassifier.html)\n",
    "to find out about the meaning of the `\"stratified\"` and `\"uniform\"`\n",
    "strategies."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Write your code here."
   ]
  }
 ],
 "metadata": {
  "jupytext": {
   "main_language": "python"
  },
  "kernelspec": {
   "display_name": "Python 3",
   "name": "python3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}